fix(registration): jitter cooldown exit and rate-limit registration retries#860
Conversation
7293cdf to
f3c4af4
Compare
4411ada to
9e7e329
Compare
9e7e329 to
b916444
Compare
jtolentino1
left a comment
There was a problem hiding this comment.
LGTM from my testing.
I tested the newer integrated images (cryostat-agent-init:registration-herd-6 and cryostat:4.2.0-registration-herd- 5) on OpenShift with 22 injected Agent replicas. The 30+ minute soak stayed stable with 22 ready pods and 22 Cryostat targets, and aliases/connectUrls matched the live Agent pods. Scaling 22 -> 12 -> 22 also converged and Cryostat tracked the instances correctly.
I also tested the registration behavior directly. Repeated refresh pings returned 204, but the Agent logged the
minimum-interval skips instead of rapidly re-registering, and the credential id stayed unchanged. After killing
Cryostat with kill 1, several Agents entered cooldown with different jittered durations around the 30s base, and the
system recovered back to 22/22 targets after about 3 minutes.
For the Cryostat-side changes, I saw the restart path using periodic discovery jobs with no old discovery.startup
jobs left, and the new fault-tolerance rate limits fired for the registration/credential paths during recovery.
One note: after restart/recovery I did see stale discovery.periodic Quartz jobs logging Plugin not found, and the
DB had more periodic jobs/credentials than live plugins, but the visible target state recovered correctly.
|
Thanks for the detailed analysis @jtolentino1 !
This is "expected" in the current server-side implementation - when the job next runs, it'll cancel its own trigger if it detects that the Target it's set up for has disappeared. After a few minutes the persisted periodic jobs state in the database should settle back to 1:1 with the discovered targets once the system has made a full recovery. Credentials should also eventually settle back to 1:1, but it's not critical if there are stale Credentials left around with 0 matching targets. If that is the case then that's another bug we should fix, but I think that can wait. |
Based on #858
Depends on #858
See #851
Adds two more behaviours: